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Abstract 

This  memorandum  presents  the  results  of  a  phonetically  motivated  analysis  of  the  speech 
recognition  system  developed  as  part  of  the  ARM  (Airborne  Reconnaissance  Mission)  project. 
The  aim  of  the  work  described  here  is  to  investigate  to  what  extent  errors  can  be  explained  by 
phonetic  effects;  those  which  cannot  may  indicate  where  models  may  be  improved.  The 
background  to  the  investigation,  and  the  problems  of  evaluating  phoneme  recognition 
performance  are  described,  then  the  remainder  of  the  report  is  concerned  with  a  detailed 
analysis  of  specific  types  of  errors,  motivated  by  a  desire  to  find  phonetic  explanations  of 
them. 
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1.  Introduction 

This  memorandum  presents  the  results  of  •  phonetically  motivated  analysis  of  the  speech  recognition  system 
developed  as  part  of  the  ARM  (Airborne  Reconnaissance  Mission)  project.  The  aim  of  the  ARM  project  is 
accurate  recognition  of  continuously  spoken  airborne  reconnaissance  reports  using  sub-word  (phoneme) 
hidden  Maikov  modelling  techniques.  The  version  of  the  system  on  which  this  study  is  based  is 
speaker-dependnu  and  has  a  vocabulary  of  497  words.  The  ARM  system  is  described  in  15].  The  version  of 
the  system  on  which  this  investigation  was  based  scores  an  average  of  86.8%  word  accuracy  with  word  level 
syntax  (i.e.  perplexity  »  497). 

The  aim  of  the  work  described  here  is  to  investigate  to  what  extent  errors  can  be  explained  by  phonetic  effects; 
those  which  cannot  may  indicate  where  models  may  be  improved.  For  instance,  if/p/ismisrecognised  as  /b/, 
this  is  understandable  from  the  phonetic  point  of  view  as  the  two  are  acoustically  rather  similar,  however,  if  /pi 
were  to  be  consistent!  y  misrecognised  as  U/  or  £/  this  error  would  be  difficult  to  explain  in  acoustic-phooeuc 
terms,  and  would  probably  indicate  that  there  is  something  wrong  with  the  model(s). 

The  following  section  describes  the  background  to  the  investigation,  and  the  problems  of  evaluating  phoneme 
recognition  performance.  The  remainder  of  the  report  is  then  concerned  with  a  detailed  analysis  of  specific  types 
of  errors,  motivated  by  a  desire  to  find  phonetic  explanations  of  them.  The  phonemic  transcriptions  in  this  report 
are  in  the  SAM-PA  notation  (2, 8],  and  see  Appendix  A  for  the  list  of  phonemes  and  examples. 


2.  Background 


2.1  The  ARM  task 

The  airborne  reconnaissance  mission  reports  which  the  ARM  system  recognises  follow  a  standard  format, 
beginning  with  some  highly  structured  sentences  recounting  the  mission  details,  such  as  time  and  place  of 
observation.  Then  follows  a  slightly  more  free-formal  section  where  the  reconnaissance  pilot  describes  what 
he  sees  and  assesses  its  condition.  The  report  concludes  with  a  brief  description  of  the  weather  and  visibility 
conditions.  The  vocabulary  of  the  system  with  its  citation-form  phoneme  transcription  can  be  found  in 
Appendix  B.  An  example  of  an  ARM  report  is  given  below. 

Recce  report  two  stroke  Charlie  stroke  six  eight  one.  Military  activity  at  map  co-ordinates  India  hotel 
eight  four  three  four,  rime  over  target  eleven  oh  seven  GMT.  New  target  cat  zero  one;  operational 
airstrip.  Roughly  fifteen  light  aircraft  of  type  possibly  foxbat .  Main  runways  heading  southwest  wholly 
unusable.  SAM  defences  intact.  TARWI  fife  eighths  at  niner  hundred;  end  of  report. 

It  is  important  to  note  that  this  is  not  a  natural  use  of  language,  and  this  may  influence  the  generality  of  the  results 
of  this  study,  in  that  the  relative  frequency  of  phonemes  in  the  ARM  vocabulary  will  not  necessarily  match  that 
in  natural  language.  In  particular,  the  phoneme /D/,  which  ranks  eighth  in  normal  use  (due  to  the  high  frequency 
of  the  word  “the"  in  natural  speech),  does  not  occur  at  all  in  the  data  of  two  of  the  speakers  examined  here,  and 
only  occurs  once  for  Speaker  2,  who  pronounces  the  word  “with"  as  /wID/,  rather  than  /WIT/.  A  comparison 
of  phoneme  frequencies  in  normal  speech  [3]  with  those  in  the  ARM  data  can  be  found  in  Appendix  A. 

2.2  The  Speakers 

The  system  currently  recognises  the  speech  of  three  speakers,  and  is  trained  separately  for  each,  using 
approximately  fifteen  minutes  of  speech  (airborne  reconnaissance  mission  reports)  from  each  of  the  three. 
Speakers  1  and  2  are  male;  Speaker  1  is  basically  RP,  while  Speaker  2  hus  Midlands  ovenones.  Speaker  3  is 
female  and  has  north-eastern  colours  in  her  accent  Each  speaker  has  their  own  dictionary  to  take  account  of 
dialectal  variations.  In  this  report  1  will  be  trying  to  draw  some  general  conclusions  about  error  types  which 
apply  to  all  three  speakers,  but  the  more  important  speaker  differences  will  also  be  pointed  out 
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2.3  The  System 

The  ARM  system  is  described  in  detail  in  [7].  Sub-word  (phoneme-like)  hidden  Markov  models  are  used,  but 
it  is  well  known  that  (he  acoustic  realisation  of  phonemes  wies  in  different  contexts.  In  order  to  take  account 
of  this  context-sensitivity,  approximately  1500  triphones  are  used.  Triphone  modelling  assumes  that  it  is  the 
immediately  surrounding  context  which  exerts  the  most  influence  on  die  acoustic  realisation  of  a  particular 
phoneme,  aoatriphone  is  a  model  of  a  phoneme  in  its  left  and  right  con  text.  In  the  current  system  this  is  restricted 
to  word-internal  contexts.  See  [6]  for  a  full  description  of  the  triphone  methods  used  in  the  ARM  system. 

In  addition  to  the  triphones  for  each  context-sensitive  phoneme,  a  number  of  short  words  are  modelled  explicitly 
at  the  word  level.  Non-speech  sounds,  such  as  breath  noise  or  lip  smacks  are  also  modelled  explicitly  with  a 
set  of  single  state  models.  Both  the  word  models  and  the  noise  models  are  treated  in  exactly  the  same  way  as 
the  triphones. 

For  the  purposes  of  the  analysis  described  here  the  system  was  configured  as  a  phoneme  tecogniser,  with  no 
dictionary  and  no  syntax.  There  is,  however,  some  measure  of  constraint,  in  that  the  right  context  of  each 
triphone  must  match  the  left  context  of  the  next  This  is  no  small  constraint;  as  the  triphones  are  word-internal, 
and  the  vocabulary  so  limited,  the  number  of  different  triphones  that  actually  occur  is  very  small.  (There  are 
1456  different  triphones  in  the  ARM  set,  while  a  68000-word  dictionary  has  14378.) 


2.4  Evaluating  the  performance 

The  system  has  so  far  been  tested  on  ten  ARM  reports  (that  were  not  in  the  training  set)  from  each  speaker, 
containing  a  total  of  approximately  2290  phonemes  per  speaker,  6873  in  all.  The  arrangement  described  above 
produced  an  average  phoneme  error  rate  of  26.2%  for  the  three  speakers.  Phoneme  recognition  performance 
is  measured  by  aligning  the  output  of  the  system  with  a  phonemic  transcription  of  the  test  material.  The  latter 
is  obtained  by  replacing  each  word  in  the  orthographic  transcription  of  the  data  with  its  phonemic  transcription 
from  the  (speaker-dependent)  dictionary.  Errors  are  classified  as  substitutions,  deletions  or  insertions. 
Substitutions  occur  when  a  phoneme  ismisrecognised  as  another  phoneme,  deletions  when  a  phoneme  has  been 
missed  by  the  system,  and  insertions  when  die  system  has  recognised  an  extra  phoneme.  Recognition 
performance  is  stated  in  terms  of  correctness  and  accuracy.  The  first  is  simply  a  measure  of  how  many  times 
the  system  produced  the  same  label  as  the  dictionary  transcription,  while  the  second  is  amore  stringent  measure, 
which  is  calculated  by  subtracting  the  number  of  insertions  from  the  number  of  correctly  recognised  phonemes, 
and  as  such  is  a  more  satisfactory  indicator  of  the  recognition  performance. 

The  alignment  of  recognition  results  and  transcription  is  automatic,  and  a  summary  of  individual  phoneme 
performance  is  also  produced,  along  with  a  confusion  matrix.  However,  this  process  is  not  accurate,  in  that 
sometimes  errors  in  the  alignment  obscure  correct  matches,  Kid  insertions  are  counted  as  substitutions,  etc.  If 
alignment  errors  are  taken  into  account  die  overall  results  are  not  significantly  different  (on  average  slightly  over 
1%  either  way),  but  as  it  is  the  distribution  and  details  of  the  error  types  that  are  most  important  for  this 
investigation,  it  is  necessary  to  hand  correct  the  alignment  and  scoring.  This  is  quite  a  lengthy  process  as  it 
involves  listening  to  the  speech  at  die  same  time  as  observing  the  labelling  produced  by  die  system  on  the 
spectrograms,  and  then  re-compiling  the  phoneme  statistics  «nd  confusion  matrix.  All  the  analyses  in  this  paper 
are  based  on  hand-corrected  alignments. 

It  is  in  practice  extremely  difficult  to  assess  performance,  as  in  many  cases  the  speaker  may  not  actually  produce 
the  somewhat  idealised  pronunciation  represented  in  the  dictionary.  For  example,  in  the  sequence  "six  six"  die 
speaker  is  likely  to  produce  only  one  Is/  (though  it  may  be  somewhat  lengthened)  for  the  two  which 
phonemically  occur  over  the  word  boundary.  In  this  example,  if  the  system  recognises  only  one  /s/it  is  penalised 
for  having  deleted  a  phoneme.  There  are  numerous  examples  of  this  nature,  and  these  will  be  discussed  under 
the  appropriate  categories  below.  In  spite  of  these  shortcomings,  results  trt  scored  strictly  against  the  dictionary 
transcription  in  order  to  ensure  that  the  evaluation  system  is  both  consistent  and  automatic.  We  are.  however, 
currently  investigating  the  inclusion  of  alternative  transcriptions  in  the  dictionary,  which  will  allow  us  to  take 
account  of  many  of  these  so-called  errors. 


3.  The  Analysis 

In  this  section  the  phoneme  recognition  results  are  analysed  in  some  detail.  A  summary  of  the  phoneme 
recognition  results  for  each  speaker  and  for  all  speakers  combined  is  shown  in  Table  1.  From  this  it  cube  seen 
that  the  results  for  all  three  speakers  are  in  the  same  range,  although  Speaker  2  and  Speaker  3  have  slightly  better 
performance  than  Speaker  1.  This  general  trend  is  evident  in  most  of  the  more  detailed  analyses  of  phoneme 
performance;  particular  differences  between  speakers  will  be  pointed  out  below. 


Speaker 

% 

correct 

% 

substitution 

deletion 

% 

accuracy 

■otal  no.  of 
phonemes 

1 

7S3 

13.2 

113 

68.8 

2290 

2 

82.0 

10.9 

7.1 

76.8 

2290 

3 

80.8 

11.6 

7.6 

76.0 

2293 

All 

speakers 

79.4 

11.9 

8.7 

73.8 

6873 

Table  1  Summary  of  phoneme  recognition  results 


The  rest  of  Section  3  is  devoted  to  a  discussion  of  the  different  types  of  error.  A  complete  set  of  tables  showing 
the  individual  phoneme  performance  for  each  speaker  and  all  speakers  combined  can  be  found  in  Appendix  C 
(tables  Cl -4).  For  convenience  and  clarity  in  the  following  seciionsonly  the  information  about  the  factors  under 
discussion  will  be  presented. 


3.1  Analysis  of  Correctness/accuracy 

3.1.1  Individual  phoneme  ; 

Table  2  shows  the  phoneme  correct  and  accuracy  scores  for  each  speaker  and  all  speakers.  A  number  of 
phonemes  </D,  ol,  3,  e@  and  U/)  occur  so  rarely  in  the  ARM  reports  that  their  results  are  unreliable  indicators 
of  performance,  so  these  will  be  ignored  in  this  analysis.  The  models  with  the  poorest  performance  were  those 
for  whole  words,  which  tended  to  be  confused  with  one  or  more  phonemes.  Although  there  are  slight  differences 
between  speakers  at  the  top  end  of  performance,  it  can  be  seen  that  in  general  /A/  was  recognised  most  reliably, 
closely  followed  by  /el/,  /S/and  IOI. 

It  is  at  the  lower  end,  however,  that  more  obvious  differences  emerge.  None  of  the  speakers  has  good 
performance  for  /N/,  for  instance,  snd  only  Speaker  2  has  a  reasonable  score  for  /v/.  In  Speaker  l's  data, /p/, 
although  well  recognised,  suffered  from  some  insertions,  and  was  one  of  the  least  accurately  recognised 
phonemes.  For  this  speaker  too  /m/  scored  particularly  badly,  but  lil  was  the  least  accurate  due  to  an  unusually 
large  number  of  insertions  (these  will  be  discussed  later).  It  was  /V/  that  was  least  correctly  recognised  in 
Speaker  2's  case,  and  this  phoneme  was  relatively  often  inserted  too,  making  it  the  least  accurate  phoneme  for 
this  speaker.  This  may  be  due  to  there  being  relatively  few  occurrences  of  this  phoneme  in  Speaker  2’s  data  (18 
as  opposed  35  in  the  other  speakers '  reports.  In  Speaker  3 's  case  the  least  correct  (apart  from  /v/  and  /N/  which 
was  common  to  the  other  speakers)  was  /b/. 

In  trying  to  find  general  trends  in  phoneme  recognition  performance  the  phonemes  were  classified  into 
phonetically  motivated  groups,  namel  y  ‘manner '  and  'place'  of  articulation.  (I  have  disregarded  the  word-level 
models  in  this  classification .)  Under '  manner '  there  is  a  broad  classification  into  vowels  and  consonants,  which 
should  be  self-evident,  and  a  finer  one  where  consonants  are  split  into  more  specific  classes.  A  list  of  the 
members  of  each  of  these  classes  is  given  in  Figure  I ,  along  with  the  total  number  of  phonemes  in  each  class. 
The  fineness  of  the  place  classification  was  chosen  in  an  attempt  to  make  sure  that  there  were  enough  members 
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of  a  given  class  to  give  a  reasonable  sample.  In  the  case  of  the  centring  diphthongs  and  palatal-alveolar*  there 
are  probably  too  few,  but  it  would  not  have  been  reasonable  to  include  them  in  any  of  the  other  classes.  It  would 
therefore  be  unwise  to  draw  any  conclusions  about  these  two  classes.  It  may  be  useful  to  note  that  in  general 
these  classes  of  phonemes  occur  comparatively  rarely,  either  in  normal  speech  or  in  the  ARM  test  data  (see 
Appendix  A).  So /e@/ and /I@/ rank  40th  and  4 1st  in  normal  speech  and  40fh  and  30th  in  ARM.  And  for  the 
palatal-el veolars,  fS/  ranks  31st  in  normal,  27th  in  ARM ;  AS/  38th  (37th);  KZI  36th  in  both;  fjj  ranks  32nd  in 
normal,  33rd  in  ARM,  and  !ZJ  does  not  occur  at  all  in  ARM,  and  ranks  43rd  in  normal. 


Speaker  1  Speaker  2  Speaker  3  All 

%  %  %  %  %  %  %  % 


Ac c 

Ccw 

Acc 

Car 

Cor 

Act 

Toul 

t 

83.8 

75.7 

868 

80.9 

86.8 

83.8 

85.8 

803 

408 

t 

86.0 

70.2 

71.9 

50.8 

82.5 

77.2 

80.1 

653 

171 

s 

83.9 

83.9 

100.0 

100.0 

96.8 

96.8 

933 

933 

93 

f 

89.8 

85.9 

89.8 

833 

83.4 

80.8 

87.6 

83  3 

234 

V 

36.7 

20.0 

83.3 

70.0 

50.0 

36.7 

56.7 

423 

90 

T 

83.3 

712 

65.7 

54.3 

75.0 

69.4 

74.8 

653 

107 

D 

0.0 

0.0 

_ 

ao 

0.0 

1 

h 

61.5 

61.5 

76.9 

76.9 

61.5 

463 

66.7 

61.6 

39 

iS 

66.7 

66.7 

88.9 

88.9 

88.9 

88.9 

813 

813 

27 

dZ 

72.7 

72.7 

81.8 

63.6 

63.6 

63.6 

72.7 

66.6 

33 

p 

92.7 

585 

90.2 

61.0 

82.9 

583 

88.6 

593 

123 

b 

60.0 

46.7 

60.0 

60.0 

53.4 

26.7 

57.9 

44.4 

45 

t 

85.3 

81.8 

85.3 

83.1 

87.0 

82.7 

85.9 

82.6 

693 

d 

59.8 

19.5 

63.4 

52.4 

59.8 

64.6 

43.9 

246 

k 

89.7 

84.5 

92.7 

88.7 

90.7 

89.7 

91.1 

87.6 

291 

$ 

75.6 

75.6 

78.0 

78.0 

87.8 

87.8 

80.5 

803 

123 

re 

38.8 

34.7 

85.7 

71.4 

91.8 

85.7 

72.2 

64.0 

147 

n 

55.0 

54.4 

80.1 

73.7 

77.2 

76.0 

7as 

66.7 

513 

N 

55.6 

55.6 

50.0 

44.4 

50.0 

51.9 

50.0 

54 

] 

78.7 

70.7 

86.7 

81.3 

76.0 

66.7 

80.5 

72.9 

225 

T 

88.4 

84.5 

96.1 

96.1 

884 

85.4 

90.9 

88.7 

309 

W 

79.6 

79.6 

86.4 

81.8 

90.9 

864 

856 

82.6 

132 

j 

100.0 

85.7 

71.4 

71.4 

85.7 

85.7 

85.7 

80.9 

42 

i 

84.1 

77.3 

84.1 

80.6 

93.7 

93.7 

86.1 

82.1 

224 

I 

65.5 

53.8 

70.9 

70.1 

76.3 

66.9 

711 

64.0 

394 

H 

74.7 

713 

84.3 

83.1 

85.6 

81.5 

78.7 

249 

( 

90.2 

88.2 

89.5 

84.2 

86.0 

883 

86.1 

165 

A 

97.6 

97.6 

97.1 

943 

91.4 

97.3 

953 

in 

Q 

76.9 

76.9 

96.7 

96.7 

84.6 

84.6 

89.3 

89.3 

56 

0 

81.1 

78.4 

94.6 

94.6 

100.0 

91.9 

91.0 

111 

V 

100.0 

100.0 

66.7 

33.3 

66.7 

66.7 

77.8 

66.7 

9 

u 

72.7 

67.3 

83.6 

81.8 

69.1 

67.3 

75.2 

723 

165 

3 

100.0 

66.7 

100.0 

100.0 

66.7 

100.0 

773 

9 

e 

56.0 

50.0 

66.0 

58.7 

64.7 

513 

612 

533 

450 

V 

82.8 

77.1 

50.0 

27.7 

74.3 

68.6 

717 

63.6 

88 

el 

89.8 

89  8 

100.0 

100.0 

91.8 

91.8 

93.9 

93.9 

147 

•I 

86.3 

86.3 

94.1 

90.2 

94.1 

94.1 

91.5 

893 

153 

ol 

100.0 

100.0 

100.0 

100.0 

3 

•V 

87.5 

87.5 

87.5 

87.5 

93.8 

873 

89.6 

873 

48 

71.4 

66.1 

82.1 

80.4 

76.8 

76.8 

76.8 

74.4 

168 

\ 

94.1 

94.1 

94.1 

82.4 

82.4 

912 

903 

51 

E29 

50.0 

50.0 

50.0 

50.0 

50.0 

6 

<u> 

19.0 

19.0 

3S-1 

38.1 

<2.9 

42.9 

33.3 

333 

63 

<oh> 

50.0 

50.0 

33.3 

33.3 

50.0 

44.4 

44.4 

18 

<ai> 

45.5 

45.5 

27.3 

27.3 

453 

453 

39.4 

39.4 

33 

50.0 

50.0 

50.0 

0.0 

333 

333 

6 

Overall 

75.5 

68.8 

82.0 

76.8 

80.8 

76.0 

79.4 

73.8 

6873 

Table  2  Individual  phoneme  correct/accuracy  for  each  and  all  speakers 


3.1.2  Manner  of  articulation 

There  is  no  significant  difference  in  the  recognition  performance  between  vowels  and  consonants,  with  vowel 
correctness  793%  (n«2607)  and  consonants  80.6%  (n-4 146).  However,  consonants  are  more  than  twice  as 
likely  to  be  inserted  as  vowels;  267  insertions  compared  with  1 19,  making  the  accuracy  for  the  consonants 
slightly  lower,  consonants  74.2%,  vowels  75.0%. 
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Again  the  full  set  of  results  for  all  speakers  can  be  found  in  Appendix  C  (tables  CS-8);  only  the  information 
relevant  to  the  cunent  discussion  will  be  presented  here.  The  results  analysed  in  terms  of  phoneme 
correctness/accuracy  by  manner  of  articulation  are  therefore  shown  in  Figure  2.  The  overall  manner  class 
accuracy  was  87.1%. 


MANNER 

Plosive 

pbtdkg 

1521 

Affricate 

tSdZ 

60 

Strong  fricative 

sz  S 

672 

Week  fricative 

fvTDh 

471 

Liquid/Glide 

lrwj 

708 

Nasal 

n  m  N 

714 

Vbwel 

ilE  (  AQO 
uU  V@3el 
al  ol  aU  @11 
l@e@ 

2607 

PLACE 


Labial 

pbmfvTDw 

879 

Alveolar 

tdnszlr 

2565 

Palatal-alveolar 

StSdZj 

195 

Velar 

kgNh 

507 

Front 

ilE  { 

1032 

Central 

V@3 

547 

Back 

AOQUu 

452 

Fronting 

aleloi 

303 

Centring 

I@e@ 

57 

Backing 

aU@U 

216 

Figure  1  Key  of  Manner  and  place  class  membership 

Liquids/glides  and  strong  fricatives  were  recognised  most  correctly  and  accurately  for  all  speakers.  Nasals  were 
quite  clearly  the  worst,  especially  far  Speaker  1,  though  the  accuracy  of  weak  fricatives  was  also  poor  because 
of  the  high  number  of  insertions.  Both  of  these  classes  may  be  acoustically  weak,  and  M  especially  is  easily 
missed,  which  might  explain  their  poor  performance.  It  is  not  surprising  that  strong  fricatives  should  be  well 
modelled,  as  they  are  generally  acoustically  prominent  (compared  to  weak  fricatives,  especially).  More 
unexpected  was  the  good  performance  of  liquids  and  glides  which  are  often  thought  to  be  problematic  for 
systems  with  limited  ability  to  model  temporal  dynamics.  The  explanation  for  this  may  be  provided  by  the 
variable  frame  rate  analysis  which  is  used  [4);  areas  which  are  acoustically  stable  are  compressed  into  a  smaller 
number  of  frames/states,  while  those  that  vary  rapidly,  such  as  /r/,  /j/ and  /w/are  modelled  using  comparatively 
more  states,  giving  the  improved  time  resolution  needed  to  identify  these  sounds. 


Ploaiva  Affricate  Str,  trie  Wk.  trie  Liq/Slidu  Nasal  Vowel 


Figure  2  Graph  of  manner  class  cotrect/accuracy 


5 


3.1  J  Place  of  articulation 


Figure  3  gives  the  analysis  of  the  results  grouped  by  place  of  articulation  (and  see  Appendix  C,  tables  C9-12). 
The  overall  place  class  accuracy  was  84 .4% .  From  the  graph  it  can  be  seen  that  diphthongs  which  move  towards 
a  front  position  are  most  accurately  recognised;  while  among  the  consonants,  palatal-alveolars  are  the  best 
recognised.  Perhaps  not  surprisingly,  central  vowels  were  poorly  dealt  with.  The  /@/  vowel  represents  a  large 
proportion  (over  80%)  of  the  central  vowels  and  as  this  vowel  is  unstressed  and  notoriously  variable,  it  is  not 
surprising  it  is  rather  loosely  modelled,  and  is  not  only  easily  confutable,  but  frequently  inserted  too.  Labia) 
consonants  are  only  moderately  well  modelled,  perhaps  because  most  of  the  weak  fricatives  are  in  this  group, 
and  these  are  often  acoustically  indistinct.  These  results  are  strikingly  consistent  across  speakers. 


Labial  Alv  Pal-Alv  Velar  Front  Central  Back  Front’s  Back’s  Centr’g 
Figure  3  Graph  of  place  class  coreect/accuracy 


3.2  Substitutions 

3.2.1  Individual  phonemes 

The  substitution  rates  for  individual  phonemes  for  each  and  all  speakers  is  presented  in  Table  3.  The  function 
words  quite  clearly  were  much  substituted,  and  of  the  phonemes  /N/,  /V/  and/  <®U/  are  most  likely  to  be 
substituted .  /S/,  /A/  and  fr/  are  least  confused  (ignoring  those  phonemes  mentioned  earlier  that  occur  only  a  few 
times). 

When  the  system  misrecognises  one  phoneme  as  another  it  is  important  to  be  able  to  explain  why  this  has 
happened.  If  the  two  phonemes  involved  differ  minimally,  in  one  phonetic  feature  (/jVand  /b/,  for  instance)  then 
it  may  be  difficult  to  improve  either  model  to  separate  them.  If,  however,  larger  differences  are  involved,  there 
may  be  more  scope  for  better  modelling.  In  order  to  investigate  what  proportion  of  the  substitute.  •>  errors  were 
phonetically  predictable,  phoneme  confusion  matrices  were  constructed,  and  these  can  be  found  in  Appendix 
C  (tables  Cl  3-16).  In  general  confusions  are  with  phonetically  similar  sounds,  though  there  are  some 
exceptions  to  this,  which  are  difficult  to  explain,  even  when  there  appears  to  be  some  pattern  to  them.  For 
example,  all  4  of  Speaker  1  ’s/v/-/@/ confusions  occurred  in  the  word  "seven”,  but  there  were  as  many  occasions 
when  the  /v/  in  this  word  was  correctly  recognised,  so  it  is  not  possible  to  make  any  generalisations  about  the 
cause  of  this  error. 
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450 
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45 

V 

14.3 

33.3 

17.1 

21.6 

88 

l 

5.6 

6.9 

10.4 

7.6 

693 

el 

10.2 

0.0 

8.2 

6.1 

147 

d 

14.6 

17.1 

15.9 

15.9 

246 

al 

5.9 

5.9 

5.9 
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48 

IT) 

32.6 
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4.1 

15.6 

147 

23.2 

14.3 

21.4 

19.6 
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n 

25.1 

8.8 

113 

15.4 

513 

If 

5.9 

0.0. 

17.6 

7.8 

51 

N 

27,8 

22.2 

33.3 

27.8 

54 

50.0 

50.0 

50.0 

50.0 

6 

1 

5.3 

9.3 

10.7 

8.4 

215 

<ji> 

81.0 

57.1 

57.1 

65.1 

63 

r 

1.9 

1.0 

5.8 

2.9 

309 

<oh> 

50.0 

66.7 

50.0 

55.6 

18 

w 

15.9 

9.1 

6.8 

10.6 

132 

<of> 

545 

717 

36.4 

54.5 

33 

i 

0.0 

14.3 

14.3 

9.5 

42 

<or> 

50.0 

50.0 

100.0 

66.7 

6 

Table  3  Phoneme  substitutions  (%)  for  each  and  all  speakers 


3.2.2  Manner  of  articulation 

There  is  no  evidence  that  either  vowels  or  consonants  are  more  subject  to  substitution.  Figure  4  shows  the  rate 
of  substitutions  for  manner  of  articulation. 


Figure  4  Graph  of  manner  class  substitutions 

Confusions  with  phonemes  from  the  same  class  would  be  more  explicable  than  those  with  a  different  one, 
although  there  is  a  hierarchy  of  class  similarities.  For  instance,  plosives  are  more  like  affricates  than  vowels; 
nasals  are  more  like  vowels  than  they  art  strong  fricati  ves.  And  ingeneralthisiswhatwefindin  tb  tARM  resul  ts. 
Consonants  are  recognised  as  consonants  93%,  and  vowels  as  vowels  nearly  90%  of  the  time.  The  results  of 
the  finer  manner  class  analysis  of  confusions  for  all  speakers  are  show  in  Table  4  (those  for  individual  speakers 
can  be  found,  as  usual  in  Appendix  C,  tables  Cl  7-20).  This  matrix  shows  how  often  phonemes  from  one  class 
were  recognised  as  phonemes  from  other  classes.  The  matrix  diagonal  shows  wi thin-class  recognitions. 

Nasals  were  the  most  confused,  though  most  of  the  confusions  are  predictable;  nasals  share  stop-1  ike 
characteristics  with  plosives,  and  a  vowel-like  structure  with  liquids  and  vowels.  It  is  interesting  that  almost 
all  (95%)  of  the  nasal/plosive  confusions  were  for  Speaker  1 ,  where  At/  was  mostly  misrecognised  as  M  or/d/. 
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Plosives  were  misrecognised  most  often  as  vowels.  Nearly  half  of  these  unexpected  confusions  are  with  central 
vowels,  indicating  that  /@l  is  a  major  culprit  in  misrecognition  (as  well  as  being  misrecognised  itself).  In 
general  plosives  are  the  most  often  substituted  class. 


SfcRecognised 


Plo 

AfL 

_  SF_ 

WF 

_  UG 

mA'M 

Vow 

Tcul 

Pk*ive 

87.4 

0.3 

o.s 

0.7 

0.6 

03 

1.0 

137 

s 

Affricue 

3.3 

81.7 

6.7 

1.7 

1.7 

11 

p 

Sir  Fric 

1.5 

04 

90.8 

03 

0.7 

0.3 

574 

o 

WkFric 

5.3 

02 

0.4 

80.0 

13 

362 

k 

Liq/Gtide 

0.7 

0.1 

0.6 

§73 

04 

3.7 

611 

e 

Nti*] 

22 

0.3 

1.8 

77.2 

3.8 

497 

n 

Vowel 

0.6 

o't 

0.2 

0.6 

as 

89.9 

335 

Table  4  Confusion  matrix  for  manner  of  articulation  -  all  speakers 


The  rest  of  the  matrix  is  very  much  as  one  would  expect.  In  general  in-class  recognition  is  good.  Affricates 
are  confused  with  plosives  and  strong  fricatives  with  which  they  share  many  features.  Weak  fricatives  are  also 
confused  with  plosives,  particular  confusion  being  III  with  fa/,  and  as  these  share  place  of  articulation,  being 
broadly  speaking  labial,  this  is  not  unexpected. 

3  J.3  Place  of  articulation 

The  substitution  rates  for  place  of  articulation  are  shown  in  Figure  5.  As  might  be  expected,  central  vowels  are 
the  weakest:  they  are  confused  with  a  wide  range  of  different  classes,  and  are  the  most  widely  substituted  class 
too.  Backing,  and  for  Speaker  3  .centring  diphthongs  are  also  frequently  confused.  It  can  be  seen  from  the  place 
confusion  matrix  in  Table  5  that  much  of  the  poor  recognition  of  labialsis  likely  to  be  due  to  them  being  confused 
with  each  other,  with  alveolars  being  the  most  likely  substitute. 


Labial  Atv  Pal-Atv  Valar  Front  Cant  Back  Fm’g  Bak'g  Ctr’g 
Figure  S  Graph  of  place  class  substitutions 


3.2.4  Contextual  Effects. 

Substitution  errors  can  sometimes  be  explained  by  the  normal  co-articulaury  processes.  Examples  such  as  fa I 
recognised  as  /N /  in  "machine  gun”,  fat/  as  fa/  before  an  alveolar  in  "platforms",  /g/ as /d/ and /d/ as  fa/ in  the 
sequence  “target  grid  ref",  /s/  as  h!  in  voiced  environment  "zero  seven",  and  the  sequence  /si/  recognised  as 
/zd/  in  the  voiced  environment  “fuel  station”  are  not  hard  to  find.  A  more  detailed  description  of  these  errors 
is  contained  in  [1  ].  These  examples  account  for  1 0%  of  substitutions. 
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In  addition,  as  has  already  been  mentioned  above,  some  substitution  errors  are  due  to  the  quite  legitimate 
variations  which  occur  in  fluent  speech,  and  these  nearly  always  involve  minimal  difTerence  between  target  and 
recognised  phoneme,  such  as  place  of  articulation  or  voicing.  The  alternation  of  /i /  with  /!/  in  final  unstressed 
syllables,  such  as  in  “facility”,  and  “twenty”,  and  /@/  with  practically  any  unstressed  vowel  is  well  known,  and 
was  the  source  of  on  average  1 5%  of  the  substitution  errors.  Such  errors  may  serve  to  bear  out  the  hypothesis 
that  a  major  part  of  the  substitution  errors  made  by  the  system  have  a  phonetic  explanation. 

%Reoottt*ed 
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Cm 

Total 

Labial 

86.1 
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n 
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0.7 
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93.1 
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03 
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03 

2.8 

0.9 

7.9 

2.8 

79.6 

0.5 

216 

Centring 

8.8 

1.8 

89.5 

57 

Table  S  Confusion  matrix  for  place  of  articulation  -  all  speakers 


3.3  Deletions 

3.3.1  Individual  phonemes 

Deletions  account  for  42%  of  the  recognition  errors,  so  it  would  be  useful  to  find  out  why  they  occur.  Table  6 
shows  the  deletion  rates  for  individual  phonemes. 

Among  the  consonants  /v/  scores  poorly,  as  does  fb/.  We  have  already  discussed  the  possible  reasons  for  the 
poor  performance  of  M.  and  of  weak  fricatives  and  nasals  in  general,  but  it  is  not  so  clear  why  a  sound  such  as 
/b/  should  be  missed,  but  since  this  is  consistent  across  speakers,  it  is  possible  that  the  models  are  defective  in 
some  way.  There  also  appears  to  be  a  problem  with  /m/ specific  to  Speaker  1 ;  28.6%  of  this  speaker's  /m/s  were 
deleted,  as  compared  to  4.1  %  for  both  Speakers  2  and  3.  There  does  not  seem  to  be  any  particular  pattern  to 
these  deletions,  and  there  is  at  the  moment  no  explanation  for  them,  except  that  the  models  may  be  unreliable. 

3  .3.2  Manner  of  articulation 

The  deletions  accordrng  to  manner  of  articulation  are  shown  in  Figure  6.  There  is  no  real  patterning  to  manner 
class  deletions,  although  strong  fricatives  appear  more  robust  for  all  speakers  than  other  classes.  The  large 
percentage  of  nasal  deletions  for  Speaker  1  is  due  in  pan  to  the  predominance  of  /m/  deletions  already 
mentioned,  but  this  speaker  also  has  nearly  twice  as  many  fnj  deletions  as  the  others. 


9 


Snkrl 

Spfcr  2 

Sokr  3 

All 

Tftttl 

Sckrl 

Spkr  2 

SofcrS 

All 

Tool 

I 

4.4 

6.6 

5.8 

5.6 

408 

i 

3.4 

13 

11 

17 

224 

z 

3.5 

7.0 

10.5 

7.0 

171 

1 

13.7 

7.7 

8.1 

9.6 

394 

S 

12.9 

0.0 

3.2 

5.4 

93 

E 

7  2 

60 

9.6 

7.6 

249 

f 

5.1 

35.1 

3.8 

4.7 

234 

{ 

2.0 

0.0 

1.7 

1.2 

165 

V 

40.0 

6.7 

43.3 

30,0 

90 

A 

0.0 

0.0 

0.0 

0.0 

111 

T 

13.9 

20.0 

8.3 

14.0 

107 

Q 

15.4 

0.0 

15.4 

7.1 

56 

D 

0.0 

.... 

0.0 

1 

O 

10.8 

17 

ao 

4.5 

111 

h 

15.4 

15.4 

15.4 

15.4 

39 

U 

0.0 

33.3 

33.3 

212 

9 

tS 

0.0 

0.0 

0.0 

00 

27 

0 

12.7 

1.8 

3.6 

60 

165 

dZ 

9.1 

0.0 

18.2 

9.1 

33 

3 

0.0 

0.0 

0.0 

ao 

9 

P 

2.4 

4.9 

4.9 

4.1 

123 

@ 

293 

160 

17.3 

20.9 

450 

b 

26.7 

33.3 

13-3 

24.2 

45 

V 

2.9 

167 

66 

8.0 

88 

I 

9.1 

7.8 

2-6 

6.5 

693 

el 

0.0 

0.0 

0.0 

ao 

147 

d 

25.6 

19.5 

13-4 

19.5 

246 

el 

7.8 

0.0 

ao 

16 

153 

k 

7  2 

5.2 

4.1 

5.5 

291 

ol 

0.0 

0.0 

0.0 

0.0 

3 

f 

7.3 

4.9 

2.4 

4.9 

123 

•U 

0.0 

0.0 

0.0 

4.1 

48 

m 

28.6 

4.1 

4.1 

12.2 

147 

5.4 

3.6 

1.8 

73.6 

168 

n 

19.9 

11.1 

10.5 

13.8 

513 

& 

0.0 

0.0 

0.0 

ao 

51 

N 

16.6 

27.8 

167 

20.3 

54 

e@ 

0.0 

0.0 

0.0 

ao 

6 

] 

16.0 

4.0 

13.3 

11.1 

225 

<it> 

0.0 

4.8 

ao 

1.6 

63 

r 

9.7 

2.9 

5.8 

6.2 

309 

<oh> 

0.0 

0.0 

ao 

ao 

18 

w 

4.5 

4.5 

2.3 

3.8 

132 

<0 {> 

0.0 

0.0 

18.2 

6.1 

33 

i 

0.0 

14.3 

0.0 

4.8 

42 

<or> 

0.0 

0.0 

ao 

ao 

6 

Table  6  Phoneme  deletions  (%)  for  each  and  all  speakers 


3  .3.3  Place  of  articulation 

Figure  7  shows  the  deletions  analysed  by  place  of  articulation.  By  far  the  most  deleted  class  is  that  of  the  central 
vowels,  and  this  is  mainly  due  to  the  phoneme  /@/,  which  accounts  for  93%  of  all  central  vowel  deletions.  The 
reasons  for  this  are  often  contextual ,  as  is  discussed  in  3.3.4  below.  Diphthongs  are  not  often  deleted,  and  this 
could  be  due  to  the  fact  that  they  are  relatively  long,  and  usually  have  quite  a  clear  structure. 


Figure  7  Graph  of  place  class  deletions 


3.3.4  Contextual  effects 

A  scored  deletion  is  often  the  result  of  the  system  labelling  two  phonemes  as  one.  The  most  typical  examples 
of  this  occur  when  the  same  sound  occurs  at  the  end  of  one  word  and  the  beginning  of  the  next,  as  in  "five  five" 
or  “six  six"  (five  is  pronounced  “fife”  to  help  avoid  confusion  with  “nine”,  which  i*  pronounced  “niner”).  When 
pronounced  in  fluent  speech  the  phonemes  tend  to  run  into  each  other,  and  the  system  recognises  only  one.  so 
the  above  examples  will  be  recognised  as  /falfalf/  and  /slksDcs/.  This  is  another  example  of  how  the  system  is 
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penalised  for  an  error  which  is  due  to  the  normal  phonological  processes  of  fluent  speech.  A  different  error  of 
this  type  may  be  attributed  to  the  fact  that  we  are  working  with  a  very  limited,  and  rather  specialised  vocabulary. 
Pan  of  the  second  AV  in  “niner"  is  often  labell  cd  as  pan  of  the  /al/.  This  may  be  due  to  the  fact  that  “niner”  occurs 
frequently  in  the  database,  so  it  will  have  a  significant  influence  on  the  (al:n_n)  triphone  model  (/al/  with  M 
as  its  left  and  right  contest). 

Many  of  the  deletion  errors  are  caused  by  genuine  elisions  by  the  speaker.  For  example,  the  unstressed  A®/ 
vowel  is  often  elided,  particularly  in  unstressed  syllables  before  a  nasal  or  liquid.  The  speaker-specific 
dictionaries  account  for  a  number  of  such  cases,  far  example  in  “seven"  (/sEvn/)  and  “hidden"  (/hldn/),  but  in 
the  present  analysis  if  an  A®/  appears  in  the  dictionary  transcription  it  will  be  scored  as  a  deletion  if  it  isn't 
recognised,  even  if  in  reality  it  wasn't  produced.  Words  like  “correction"  are  transcribed  /k@rEkS@n/,  but 
either  (or  both)  of  these  schwas  may  be  deleted  in  fast  speech  -  AcrEkSn/.  It  is  probably  for  this  reason  that  A§>/ 
is  die  most  deleted  phoneme  and  is  twice  as  likely  to  be  deleted  as  any  other  vowel.  Speaker  1  has  almost  twice 
as  many  /@/  deletions  as  the  other  two  speakers  (44  as  opposed  to  24  for  Speaker  2  and  26  for  Speaker  3).  but 
48%  of  these  deletions  may  be  attributed  to  legitimate  variation  in  the  way  in  which  some  words  are  pronounced. 
A  slightly  larger  proportion  of  Speaker  2's  schwa  deletions  can  be  so  explained  (58%),  but  less  for  Speaker  3 
(38%). 

Another  case  where  deletion  is  predictable  is  in  word-final  stops,  which  are  frequently  omitted,  particularly  in 
fast  speech  before  a  word  initial  stop  (e.g.  “target  category"  is  realised  as  AAgI  k  (t@gri/).  There  are  many  of 
this  type  of  error,  and  they  are  analysed  in  more  detail  in  [1]. 

Deletions  of  /!»/  are  yet  another  example  of  phonologically  predictable  errors.  This  phoneme  can  be  very 
variable,  as  it  tends  to  take  on  the  spectral  structure  of  the  following  vowel,  and  is  often  indistinct  from  it.  In 
addition  one  third  of  /h/  deletions  happened  after  a  voiceless  plosive  e.g.  “stroke  hotel”  which  was  recognised 
as  /strtgUk  @UtEl/.  where  it  is  likely  that  the  /W  has  been  merged  with  the  aspiration  of  the  \J.  causing  it  to 
be  missed  by  the  system.  On  average  1 5%  of  all  deletions  can  be  explained  by  phonological  effects. 


3.4  Insertions 


3.4.1  Individual  phonemes 

Insertions  occur  when  the  system  has  put  in  an  extra  phoneme  label,  and  the  numbers  of  insertions  for  each  and 
all  speakers  is  presented  in  Table  7.  Some  of  thehighly  inserted  phonemes  are  speaker  specific  -/d/ for  Speaker 
1  is  an  interesting  example  for  which  there  is  no  explanation;  others  are  common  to  all  speakers,  and  here  f@l 
is  the  clearest  example.  As  has  already  been  mentioned  the  speaker-specific  dictionaries  account  for  a  number 
of  predictable  cases  of  elision  of  this  phoneme,  but  there  are  occasions  when  the  speaker  does  pronounce  A®/, 
for  instance  produces  not  a  syllabic  M,  but/@n/.  On  average  45%  of  A®/  insertions  could  be  accounted  for  in 
this  way. 

As  an  interesting  aside;  of  the  IV  insertions,  all  six  of  those  in  Speaker  1  's  case  were  following  the  phoneme  /O/, 
This  contrasts  with  four  out  of  seven,  for  Speaker  3  and  one  out  of  four  for  Speaker  2.  This  is  interesting  as  for 
many  speakers  the  so-called  “dark  flT  resembles  «i  fOI  vowel  in  spectral  structure  (but  with  a  slightly  lower 
intensity),  and  it  could  be  that  an  off-glide  of  IOI  would  be  confused  with  /I /. 

The  phoneme  At/  was  inserted  only  twice,  in  Speaker  3's  data,  and  both  after  voiceless  plosive  (e.g.  “time” 
recognised  as  Ahalm/),  and  we  can  hypothesise  that  the  aspiration  of  the  A/  was  what  caused  the  insertion  of  /h /. 
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Sf*r2 

Sf*r3 

Total 

Spkr  1 

Sfkri 

<Mrr3 

Total 

f 

11 

8 
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i 
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3 

0 

9 
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9 

12 

3 

24 

i 
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1 

15 

32 

S 
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0 
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0 

E 

2 

1 

4 

7 

f 
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5 

2 
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( 

1 

3 

0 

4 
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4 

4 

13 
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0 

1 

1 

2 

T 

4 

4 

2 

10 

0 

0 

0 

0 
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0 

0 

0 

0 

O 

1 

0 

0 

1 

b 

0 

0 

2 

2 

u 

0 

1 

0 

1 

1 S 

0 

0 

0 

0 

3 

1 

1 

5 
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2 

0 

2 

3 

1 

0 

0 

1 

P 

14 

12 

10 

36 

« 

9 

11 

20 

40 

b 

2 

0 

4 

6 

V 

2 

4 

2 

8 

t 

8 

5 

10 

23 

el 

0 
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0 

0 

d 

33 

9 

9 

51 

al 

0 

2 

0 

2 

k 

5 

4 

1 

10 

ol 

0 

0 

0 

0 

g 

0 

0 

0 

0 

aU 
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0 

1 

1 

m 

2 

7 

3 

12 

<gu 

3 

1 

0 

4 

0 

1 

11 

2 

14 
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0 

1 

0 

1 
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0 

1 

0 

1 

e@ 

0 

0 

0 

0 

I 

6 

4 

7 

17 

<at> 

0 

0 

0 

0 

r 

4 

0 

3 

7 

<oh> 

0 

0 

0 

0 

w 

0 

2 

2 

4 

<of> 

0 

0 

0 

0 

j 

2 

0 

0 

2 

<or> 

0 

0 

0 

0 

Table  7  Phoneme  insertions  for  each  and  all  speakers 


3.4.2  Manner  of  articulation 

We  have  already  mentioned  that  consonants  are  more  than  twice  as  likely  to  get  inserted  as  vowels.  The 
comparatively  high  level  of  consonant  insertion  was  common  to  Speaker  2  (90  consonants  compared  with  30 
vowels)  and  Speaker  1  (109  consonants  and  44  vowels),  but  not  so  conspicuous  in  Speaker  3’s  results.  The 
distribution  of  insertions  with  respect  to  where  they  occur  is  interesting.  As  many  as  65%  of  consonant 
insertions  were  between  words,  perhaps  being  confused  with  breath  noise  or  lip  smacks;  while  69%  of  vowels 
insertions  were  within  words.  Figure  8  shows  the  distribution  of  the  rather  finer  manner  class  insertions. 


Plosives  are  most  frequently  inserted,  and  83%  of  these  were  between  words,  possibly  bearing  out  the  hypothesis 
that  although  there  are  explicit  models  for  breath  noise,  lip  smacks  and  other  glitches,  these  sounds  are 
nevertheless  being  recognised  as  plosives.  The  only  class  of  consonants  that  are  more  often  inserted  (77%) 
within  words  than  between  them  are  the  strong  fricatives,  and  this  may  be  for  the  same  reason  as  vowel  insertion, 
for  which  see  3.4.4  below. 
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3.4.3  Place  or  articulation 


The  distribution  of  insertions  according  to  place  of  articulation  is  given  in  Figure  9.  The  alveolan  were  most 
likely  to  be  inserted  (though  the  prevalence  of  HI  insertions  may  account  for  this).  Diphthongs  were  very  rarely 
inserted,  as  were  the  velars  and  palatal-alveolars,  though  there  are  too  few  of  the  latter  to  allow  any  conclusion 
to  be  drawn. 


Figure  9  Graph  of  place  class  insertions 


3.4.4  Contextual  efTects 

A  summary  of  the  contexts  of  insertions  can  be  found  in  II].  It  often  happens  that  a  long  phoneme  has  been 
recognised  as  two  separate  phonemes.  Sometimes  these  phonemes  will  be  identical,  as  when  AffiU /  (in  "zero", 
for  instance)  is  transcribed  as  /@U  @U/;  in  other  cases  the  insertion  is  phonetically  related  -  “many”  is 
recognised  as  /mEnil/;  or  diphthongs  may  be  recognised  as  two  vowels,  so  “sight”  gets  recognised  as  /elil/. 
Off  glides  from  vowels  are  often  recognised  as  vowel+A®/,  e.g.  IOI  in  “four”  as  IO@l,  and  A®U/  in  “zerjj”  as 
/@U  @/.  Examples  such  as  repetition  of  identical  phoneme  labels,  split  diphthongs,  and  offglide  schwa  account 
for  80%  of  the  vowel  insertions  (26%,  21  %  and  33%  respectively). 

With  the  consonants  the  reasons  for  insertions  are  not  so  clear.  Some  of  the  insertions,  like  the  vowels,  are  due 
to  two  identical  labels  being  assigned  to  one  phoneme  (13%);  others  (9%)  are  phonetically  related,  as  when  /s/ 
following  a  voiced  sound  (and  usually  word  initial)  is  transcribed  as  /z  s/.  For  example,  “four  six”  was 
recognised  /fO  z  slks/. 

In  addition,  around  4%  of  the  total  consonant  insertions  are  due  to  the  speaker’s  insertion  of  certain  sounds 
(mainly  glides)  as  linkers  to  ease  the  transition  between  sounds.  Examples  of  this  are  insertion  of  M  in  “4/8" 
/fOreltTs/,  and  between  “niner”  and  “oh”.  A  linking  /w/  is  inserted  in  2/8  -  AuweitTs/,  and  between  “tango” 
and  “eight”  A(Ng@Uwelt/;  and  /j/  in  “virtually  unusable”  /v3tS@U  j  @n..y.  The  numbers  of  such  insertions 
are  small  in  our  test  data,  but  this  probably  reflects  the  limited  occasions  when  such  links  could  occur.  For 
instance, /w/ was  inserted  only  twice  each  by  Speaker  2  and  Speaker  3,  and  not  at  all  by  Speaker  1,  but  on  each 
of  those  occasions  it  could  be  classed  as  a  linking  sound.  Similarly, /y  was  only  inserted  twice  by  Speaker  1  and 
not  at  all  by  the  others,  and  one  of  those  insertions  was  a  linking  one.  Three  out  of  four  of  Speaker  l’s,  and  two 
out  of  three  of  Speaker  3’s/r/  insertions  were  linking  cases.  The  fact  that  the  system  inserts  the  appropriate  labels 
in  these  cases  may  indicate  that  we  need  to  model  triphones  across  word  boundaries,  rather  than  just 
word-i mentally  as  we  do  at  the  moment. 
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4.  Discussion  and  conclusions 


There  are  many  interesting  observations  to  be  made  from  this  data.  What  has  been  presented  here  has  been  an 
attempt  to  pull  these  together  and  point  out  general  trends  which  might  indicate  what  the  phoneme  models  are 
doing  right,  as  well  as  what  they  are  doing  wrong. 

From  this  short  discussion  there  have  emerged  two  types  of  error  those  which  are  genuine  misrecognitions,  and 
those  which  are  due  to  the  normal  co-aniculatory  effects  in  fluent  speech,  and  are  thus  to  be  expected.  As  far 
as  the  former  are  concerned,  phonological  effects  appear  to  be  involved  in  around  30%  of  such  errors. 

The  vast  majority  of  genuine  errors  are  not  unexpected,  involving  as  they  do,  confusions  with  rather  similar 
phonemes,  or  deletions  of  acoustically  weak  segments.  Weak  sounds  such  as  nasals  or  weak  fricatives 
predictably  cause  problems,  as  does  the  neutral  l@l.  Equally,  strong  and  long  sounds  such  as  strong  fricatives 
and  diphthongs  are  well  handled.  The  surprisingly  good  recognition  of  liquids  and  glides  may  provide  an 
independent  vindication  of  the  use  of  variable  frame  rate  analysis.  A  large  number  of  the  insertions  and  deletions 
could  probably  be  prevented  if  our  duration  modelling  were  more  sophisticated. 

Although  the  majority  of  the  errors  appear  to  have  a  phonetic  basis,  there  are  cases  where  the  errors  are  as  yet 
inexplicable  from  a  phonetic  point  of  view  -  the  unusually  large  number  of  /<ty  insertions  in  Speaker  1  's  data, 
and  the  poor  recognition  of  the  same  speaker's  /m/,  Speaker  3's  /b/  and  Speaker  2's  /V/  for  example.  A  small 
number  of  phonemes  (and  Speaker  2 ’s  A7  may  belong  to  this  group)  may  simply  not  occur  frequently  enough 
for  a  reliable  indication  of  performance  to  be  made.  Where  there  isn’t  a  phonetic  explanation  of  an  error,  it  would 
be  interesting  to  find  out  if  the  system 's  own  measure  of  its  goodness  of  match  is  consistent  with  our  judgement 
of  its  performance. 

It  is  important  to  remember  that  this  study  was  based  on  a  system  which  used  no  dictionary,  although  the 
triphones  are  forced  to  match  at  the  edges.  When  lexical  and  syntactic  constraints  are  available,  as  they  are  when 
the  system  is  run  in  its  usual  mode,  as  a  word  recogniser,  then  many  of  the  problems  discussed  above  no  longer 
occur.  However,  a  general  improvement  in  the  sub- word  level  modelling  would  provide  a  sound  basis  for  better 
word  recognition  and  this  study  has  enabled  us  to  pinpoint  a  few  areas  where  our  models  might  be  improved, 
and  may  indicate  that  we  need  to  give  some  consideration  to  phonological  eflecis  across  word  boundaries.  The 
level  of  performance  depends  ultimately  depends  on  the  task  and  vocabulary,  and  a  next  step  might  be  to  assess 
the  extent  to  which  the  somewhat  specialised  vocabulary  of  the  ARM  task  has  influenced  these  results,  by 
looking  at  other  tasks,  and  bigger  vocabularies,  as  well  as  at  a  wider  range  of  speakers. 
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Appendix  A.  Phoneme  frequency 


Rank 

Frequency  [3] 

Phone 

1 

10.74 

@ 

2 

8.33 

1 

3 

7.58 

n 

4 

6.42 

t 

5 

5.14 

d 

6 

4.81 

s 

7 

3.66 

1 

8 

3.56 

D 

9 

3.51 

r 

10 

3.22 

m 

11 

3.09 

k 

12 

2.97 

E 

13 

2.81 

w 

14 

2.46 

z 

15 

2.00 

V 

16 

1.97 

b 

17 

1.83 

al 

18 

1.79 

f 

19 

1.78 

P 

20 

1.75 

V 

21 

1.71 

el 

22 

1.65 

i 

23 

1.51 

@U 

24 

1.46 

h 

25 

1.45 

{ 

26 

1.37 

Q 

27 

1.24 

o 

28 

1.15 
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29 

1.13 
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30 

1.05 
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31 

0.96 

S 

32 

0.88 

j 

33 

0.86 

u 

34 

0.79 

A 

35 

0.61 

aU 

36 

0.60 

dZ 

37 

0.52 

3 

38 

0.41 

IS 

39 

0.37 

T 

40 

0.34 

e@ 

41 

0.21 

1@ 

42 

0.14 

ol 

43 

0.10 

Z 

44 

0.06 

u@ 

Example 

ARM  frequency 

ARM rank 

alpha 

6.67 

3 

civil 

5.80 

5 

new 

7.60 

2 

target 

10.27 

1 

damaged 

3.64 

10 

Six 

6.04 

4 

lima 

3.33 

12 

then 

0.01 

42 

rail 

4.58 

6 

map 

2.18 

-18 

correction 

4.31 

7 

enemy 

3.67 

9 

jyell 

1.96 

20 

comprising 

2.53 

13 

oyer 

1.33 

27 

apout 

0.67 

33 

dimension 

2.27 

17 

file 

3.46 

11 

papa 

1.82 

=21 

UP 

1.30 

28 

containing 

2.18 

=18 

beacon 

3.91 

8 

clpse 

2.49 

14 

hotel 

0.58 

=35 

damaged 

2.44 

=15 

approx 

0.83 

29 

four 

1.64 

=23 

bearing 

0.80 

30 

two 

2.44 

=15 

grass 

1.82 

=21 

ambush 

1.38 

26 

yards 

0.62 

34 

woods 

0.13 

=38 

chailie 

1.64 

=23 

south 

0.71 

32 

damaged 

0.49 

36 

heard 

0.13 

=38 

Charlie 

0.40 

37 

three 

1.58 

25 

ait 

0.08 

40 

clear 

0.76 

31 

destroyed 

0.04 

41 

pleasure 

0.00 

-43 

poor 

0.00 

=43 
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Appendix  B.  The  ARM  vocabulary 


about 

access 

acquisition 

active 

activity 

aerials 

aircraft 

air-raid 

alpha 

ammo 

and 

antennae 

anti-armour 


armoured 

array 

as 

assembly 

badger 

beacon 

bearing 

below 

bomber 

bravo 

brow 

camouflaged 

canal 

carriage 

capability 

casevac 

category 

Charlie 

Civil 

clear 

comms 

column 

complete 

comprising 

concrete 

considerable 

construction 

convoy 

co~ords 

covert 

damaged 

data 

defence 

defended 

delta 

diameter 

dimension 

dipoles 

direction 

dispersal 

destroyed 

dump 

easy 

echo 

eighteen 

eleven 

end 


@baUt 

IksEs 

fkwIzIS<a>n 

(ktlv 

(ktlvlti 

e@rl@lz 

e@krAft 

e@reld 

Uf@ 
m@U 
nd 
(ntEnal 
(ntiAm@ 
@prQks 
elpiviz 
Am@d 
@rel 

{z 

@sEmbli 

b(dZ@ 

bLk@n 

be@rIN 

bIl@U 

bQm@ 

brAv@U 

braU 

k{m@flAZd 

k@n{l 

k[rIdZ 

kelp@blllti 

k(z@vjk 

k{t@gri 

tSAIi 

Slv@l 

kll@ 

kQmz 

kQk®m 

k@mplit 

k@mpraIzIN 

kQnkrit 

k@nsldr@bl 

k@nstrVkS@n 

kQnvoI 

k@UOdz 

k@Uv3l 

djmldZd 

delt@ 

dlfEns 

dlfEndld 

dElt@ 

dal(mlt@ 

daImEnS@n 

daIp@Ulz 

dIrEkS@n 

dlsp3s@l 

dlscrold 

dVmp 

izi 

Ek@U 

eltin 

IlEv@n 

End 


above 

ack-ack 

action 

activities 

aerial 

AEW 

airfield 

airstrip 

ambush 

ammunition 

antenna 

anti-aircraft 

anti-tank 

approximately 

armour 

arms 

artillery 

assembled 

associated 

be 

bear 

being 

blocked 

bowsers 

bridge 

C2 

camp 

cantilever 

carriages 

capacity 

cat 

centre 

circular 

civilian 

close 

collection 

communications 

comprehensive 

concealed 

conical 

consisting 

containing 

co-ordinates 

correction 

crossing 

dash 

deck 

defences 

degrees 

depot 

difficult 

dipole 

direct 

dish 

dispersals 

dug 

each 

east 

eight 

eighty 

emplacement 

enemy 


@bVv 

{k(k 

(k$<§n 

(ktlvldz 


e!idVb@lju 


e@  strip 

(mbUS 

(mjunlStffin 

(niEn@ 

(ntie@krAft 

(ntitfnk 

@prQkslm@di 

Am@ 

Amz 

Alll@ri 

@sEmbld 

@s@UsieIt@d 

bi 

be@ 

bilN 

blQkt 

baUz@z 

brldZ 

situ 

klmp 

k(ntlliv@ 

kfrldZIz 

k@p{slti 

k(t 

sEnt@ 

s3kjul<ffi 

s@vllj@n 

kl@Us 

k@lEkS<©n 

k@mjunDteIS@nz 

kQmprlhEnslv 

k@nsild 

kQnIk@l 

k@nsIsUN 

k@ntelnIN 

k@UOd!n<a>ts 

k@rEkS@n 

krQsIN 

d(S 

dEk 

dlfEnslz 

d@griz 

dEp@U 

d»flk@lt 

daIp@Ul 

dlrfckt 

dlS 

tUsp3s@lz 

dVg 

ilS 

ist 

el  i 

elti 

Emp)cl;m@nt 

En@mi 
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Appendix  B.  The  ARM  vocabulary 


engineering 

evidence 

EW 

feet 

field 

fifty 

fire 

fishbed 

fixed 

flogger 

fortified 

four 

foxbat 

freight 

fuel 

GMT 

going 

grass 

ground 

gun 

hangar 

having 

heading 

height 

helos 

hind 

hocum 

horizontal 

hospital 

hour 

hundred 

hyphen 

incomplete 

india 

inoperative 

intact 

joint 

junction 

kilometres 

knots 

launch 

launchers 

less 

lift 

like 

limited 

little 

located 

logistics 

lorry 

machine-gun 

maintenance 

many 

marshalling 

material 

mechanised 

message 

mike 

miles/hour 

minor 

missiles 


EndZlnI@rIN 

iEsEm 

Evld@ns 

idVblju 

fit 

fi@ld 

flfti 

fal@ 

fISbEd 

flkst 

flQgt® 

fOtlfald 

fO 

fQksb[t 

freh 

fm@l 

dZiEnui 

g@UlN 

grAs 

graUnd 

gVn 

h(N@ 

h(vIN 

hEdIN 

halt 

hEI@Uz 

halnd 

h@Uk@m 

hQrIzQnl@l 

hQspIi@l 

aU@ 

hVndrEd 

half@n 

lnk@mplil 

Indj@ 

InQprtgitlv 

Inl(kt 

dZolnt 

dZVnkS@n 

kllQm@t@z 

nQts 

lOntS 

10nlS@z 

lEs 

lift 

talk 

Um@Ud 

Utl 

l@UkelUd 

lQdZIstlks 

IQri 

m@SingVn 

melm@n@ns 

mEni 

mAS@lIN 

m<5>il@rl@l 

mEk@nalzd 

mEsldZ 

malk 

maIlzp3aU@ 

main® 

mlsallz 


erect 

estimated 

evident 

facility 

a_few 

fifteen 

fighter 

firing 

five 

flanker 

forger 

forty 

fourteen 

foxtrot 

frenzied 

fulcrum 

golf 

goods 

grid 

guidance 

guns 

hardened 

havok 

heavy 

helicopters 

hidden 

hip 

holding 

hom 

hotel 

howitzer 

hurried 

including 

incorporating 

infantry 

installation 

intelligence 

juliet 

kilo 

kilometres/hour 

lanes 

launcher 

length 

level 

light 

lima 

lines 

loading 

location 

loop 

machine 

main 

major 

map 

mast 

MCVs 

medium 

metres 

miles 

military 

missile 

mixed 


lrEkl 

Estimeltld 

Evld@m 

f@slllti 

@fju 

flftin 

fall® 

faI@rIN 

falf 

fl(nk@ 

fOdZ@ 

fOd 

fOdn 

fQkstrQt 

frEnzid 

fUlkr@m 

gQlf 

gUdz 

grid 

gald@ns 

gVnz 

hAd@nd 

h(v@k 

hEvi 

hElIkQpt@z 

hldn 

hip 

txauidiN 

hOn 

h@UtEl 

haUwItz® 

hVrid 

InkludIN 

InkOp@reItlN 

Inf@ntri 

Insl@leIS@n 

lniEHdZ@ns 

dZuliEt 

kil@U 

kllQm@  t@  zp3aU@ 

lelnz 

lOntS® 

IE  NT 

lEv@l 

lalt 

lim@ 

lalnz 

l@UdIN 

l@UkeIS@n 

lup 

m@Sin 

mein 

meldZ® 

m(p 

mAst 

Emsiviz 

midj@m 

mit@z 

mallz 

mll@tri 

mlsall 

mlkst 
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Appendix  B.  The  ARM  vocabulary 


mobile 

m@Uba!l 

modified 

more 

mO 

most 

motorised 

m@Ut@raIzd 

motorway 

movement 

muvm@nt 

much 

navaid 

nfveld 

near 

new 

nju 

nine 

nineteen 

nalntin 

ninety 

no 

n@U 

normal 

north 

nOT 

northeast 

northwest 

nOTwEst 

not 

noticed 

n@Utlst 

november 

number 

nVmb@ 

a-number 

numerous 

obstructed 

njum@it®s 

QbstrVktld 

observed 

occupied 

on 

Qn 

one 

operational 

Qp@reIS@n@l 

oscar 

out 

aUt 

over 

pack 

parabolic 

Plk 

p[it®bQllk 

papa 

partially 

partly 

pAtli 

passenger 

peak 

pik 

per 

perhaps 

p@h(ps 

permanent 

personnel 

p3s@nEl 

pipeline 

platforms 

plftfOmz 

platoon 

plus 

plVs 

police 

pontoon 

pQntun 

position 

possibly 

pQslbli 

practically 

prepared 

proceeding 

prlpetgd 

pr@sidIN 

principal 

projectiles 

protected 

pr@tEktld 

quebec 

radar 

reldA 

radio 

rail 

rell 

railway 

ready 

rEdi 

re-arming 

recce 

rEki 

receiver 

reconnaissance 

rt§)kQnIs@ns 

red-cross 

ref 

rEf 

reference 

re-fuelling 

rifju@UN 

refurbishing 

repair 

Kg)pe<® 

repaired 

repeat 

rtgtpit 

report 

rhombic 

rQmblk 

river 

road 

r@Ud 

rocket 

rockets 

rQklts 

romeo 

rotary 

r@Ut@ri 

roughly 

rounds 

raUndz 

runway 

runways 

SAM-7 

rVnwelz 

sfmsEvn 

SAM 

scientific 

scout 

skaUl 

section 

sections 

sEkS@nz 

self-propelled 

semi 

sEmi 

serviceable 

sets 

sEts 

seven 

seventeen 

sEvntin 

seventy 

several 

sEvrtSl 

siding 

sidings 

saldINz 

sierra 

sighted 

sal lid 

sighting 

similar 

slmll@ 

single 

site 

salt 

six 

sixteen 

slkstin 

sixty 

size 

salz 

skip 

slash 

si  (S 

small 

so 

s@U 

some 

something 

sVmTIN 

south 

southeast 

saUTist 

southwest 

mQdlfald 

m@Ust 

m@Ut@weI 

mVtS 

nl@ 

naln@ 

nalnb 

nOm@l 

nOTist 

nQt 

n@UvEmb@ 

@nVmb@ 

Qbz3vd 

Qkjupald 

wVn 

Qsk@ 

@Uv@ 

P(P( 

pAS@li 

p(s@ndZ@ 

P3 

p3m@n@nt 

palplaln 

pl@tun 

p@lis 

p@zIS@n 

pr(kllkli 

prlnslp@l 

pr@dZEkiaIlz 

kw@bEk 

re!dj@U 

tellwel 

riAmlN 

r@siv@ 

rEdkiQs 

rEfrt®ns 

ri/3WSIN 

r@pe@d 

r@pOt 

rlv@ 

rQkll 

i@Umi@U 

rVfli 

rVnwel 

s(m 

sal@ntlflk 

sEkS<ffin 

sElfpr@pHld 

s3vls@bl 

sEvn 

sEv@nti 

saldIN 

siEt@ 

saltIN 

sINgl 

slks 

sDcsti 

skip 

smOl 

sVm 

saUT 

saUTwEsi 
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Appendix  B.  The  ARM  vocabulary 


span 

speed 

sp{n 

spid 

spans 

SPGs 

sp(nz 

EspidZiz 

spoiled 

spQtld 

squad 

skwQd 

squadron 

skwQdit®n 

static 

st(Uk 

station 

stelS@n 

steel 

sti@l 

stone 

st@Un 

stop 

stQp 

storage 

stOrldZ 

stores 

stOz 

strip 

strip 

stroke 

str@Uk 

structural 

strVktS<E>r@l 

summit 

sVmll 

supply 

s@plal 

surface 

s3fls 

support 

swing 

s@pOt 

swIN 

suspension 

tactical 

s@spEnS<®n 

t(ktfic@l 

tango 

tankers 

t{Ng@U 

t{Nk@z 

tanker 

tanks 

l(Nk@ 

t|Nks 

target 

tAgh 

TARWI 

tAwi 

task 

tAsk 

taxiway 

tfkslwel 

taxiways 

t(kslwelz 

temporarily 

tEmp@re@rIli 

temporary 

tEmp@ri 

ten 

tEn 

than 

D@n 

thirteen 

T3tin 

thirty 

T3ti 

thousand 

TaUz@nd 

three 

Tri 

time 

talm 

to 

tu 

total 

t@U«@l 

tracked 

tr  (kt 

tracks 

tr  (ks 

train 

treln 

trains 

trelnz 

transmitter 

trAnzmlUg) 

transport 

trAnspOt 

trees 

triz 

troop 

trup 

troops 

trups 

twelve 

twElv 

twenty 

twEmi 

twenty-one 

twEntiwVn 

twenty-three 

twEntiTri 

twenty-two 

rwEntitu 

twin 

twin 

two 

tu 

type 

talp 

undamaged 

VndfmldZd 

undefended 

VndllEndld 

under 

Vnd@ 

unidentified 

VnaldEmlfald 

unipole 

junIp@Ul 

unknown 

Vn@Un 

unloading 

Vnl@UdIN 

unobstructed 

Vn@bslrVklId 

unoccupied 

VnQkjupald 

unoperaiional 

VnQp@reIS@n@l 

unrepaired 

Vm@pe@d 

unserviceable 

Vns3vls@bl 

unusable 

Vnjuz@bl 

uniform 

junlfOm 

up 

Vp 

U/S 

juEs 

usable 

juz@bl 

use 

jus 

vehicle 

vi@kl 

vehicles 

vi@klz 

vertical 

v3llkl 

victor 

vlkt@ 

virtually 

v3tS<S)li 

visible 

vlz@bl 

VSTOL 

vIstQl 

wagon 

w(g@n 

wagons 

w(g@nz 

water 

wOt@ 

weapon 

wEp@n 

weapons 

wEp@nz 

well 

wEl 

west 

wEst 

whiskey 

wlski 

wholly 

h@Uli 

wing 

wlN 

wire 

wal@ 

width 

wldT 

with 

wlT 

wood 

wUd 

wooden 

wUd@n 

woods 

wUdz 

work 

w3k 

worked 

w3kt 

x-ray 

Eksrel 

YAGI 

jAgi 

yankee 

j(Nki 

yard 

jAd 

yards 

jAdz 

zero 

zl@rtS)U 

zulu 

zulu 
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Appendix  C.  Results  in  full 


Phoneme  Tola] 

% 

Cor 

% 

Sub 

% 

Del 

No.  of 
Ins 

<h 

Phoneme  Total 

Cm 

Sub 

Del 

Tnt 

A  rr 

s 

136 

83.8 

11.8 

4.4 

ii 

75.7 

i 

88 

84.1 

115 

3.4 

6 

773 

z 

57 

86.0 

10.5 

3.5 

9 

70.2 

1 

117 

65.5 

18.8 

13.7 

16 

53.8 

S 

31 

83.9 

3.2 

12.9 

0 

83.9 

E 

83 

74.7 

18.1 

7  2 

2 

723 

f 

78 

89.8 

5.1 

5.1 

3 

85.9 

1 

51 

90.2 

7.8 

2.0 

1 

88.2 

V 

30 

36.7 

23.3 

40.0 

5 

20.0 

A 

41 

97.6 

2.4 

0.0 

0 

97.6 

T 

36 

83.3 

18 

13.9 

4 

72.2 

Q 

13 

76.9 

7.7 

15.4 

0 

76.9 

D 

0 

- 

O 

37 

81.1 

8.1 

108 

1 

78.4 

h 

13 

61.5 

23.1 

15.4 

0 

61 5 

U 

3 

100.0 

0.0 

0.0 

0 

100.0 

IS 

9 

66.7 

33.3 

0.0 

0 

66.7 

u 

55 

72.7 

14.6 

12.7 

3 

673 

dZ 

11 

72.7 

18.2 

9.1 

0 

72.7 

3 

3 

100.0 

0.0 

0.0 

1 

66.7 

P 

41 

92.7 

4.9 

2.4 

14 

58  J 

@ 

150 

56.0 

14.7 

293 

9 

50.0 

b 

15 

60.0 

13.3 

26.7 

2 

46.7 

V 

35 

82.8 

14.3 

2.9 

2 

77.1 

I 

231 

85.3 

5.6 

9.1 

g 

81.8 

el 

49 

89.8 

10.2 

0.0 

0 

89.8 

d 

82 

59.8 

14.6 

25.6 

33 

19.5 

al 

51 

863 

5.9 

7.8 

0 

863 

k 

97 

89.7 

3.1 

7.2 

5 

84.5 

ol 

1 

100.0 

0.0 

0.0 

0 

100.0 

g 

41 

75.6 

17.1 

7.3 

0 

75.6 

tU 

16 

873 

123 

0.0 

0 

873 

m 

49 

38.8 

32.6 

28.6 

2 

34.7 

@U 

56 

71.4 

23.2 

5.4 

3 

66.1 

n 

171 

55.0 

25.1 

19.9 

1 

54.4 

1@ 

17 

94.1 

5.9 

0.0 

0 

94.1 

N 

18 

55.6 

27.8 

16.6 

0 

55.6 

e@ 

2 

50.0 

50.0 

0.0 

0 

50.0 

1 

75 

78.7 

5.3 

16.0 

6 

70.7 

<»’.> 

21 

19.0 

81.0 

0.0 

0 

19.0 

T 

103 

88.4 

1.9 

9.7 

4 

84.5 

<oh> 

6 

50.0 

50.0 

0.0 

0 

50.0 

W 

44 

79.6 

15.9 

4.5 

0 

79.6 

<ofc> 

11 

453 

54.5 

0.0 

0 

453 

j 

14 

100.0 

0.0 

0.0 

2 

85.7 

<or> 

2 

50.0 

50.0 

00 

0 

50.0 

Table  Cl.  Phoneme  recognition  results  for  Speaker  1. 


Phoneme  Total 

% 

Cor 

% 

Sub 

% 

Pel 

No.  of 
Ins 

% 

Ac c 

Phoneme  Total 

% 

Cor 

% 

Sub 

% 

Del 

No.  of  % 
Jus _ Ace 

i 

136 

86.8 

6.6 

6.6 

8 

80.9 

j 

88 

84.1 

13.6 

2.3 

3 

80.6 

z 

57 

71.9 

21.1 

7.0 

12 

50.8 

i 

117 

70.9 

21.4 

7.7 

1 

70.1 

S 

31 

100.0 

0.0 

0.0 

0 

100.0 

E 

83 

84.3 

9.7 

6.0 

1 

83.1 

f 

78 

89.8 

5.1 

5.1 

5 

83.3 

1 

57 

89.5 

10.5 

0.0 

3 

84.2 

V 

30 

83.3 

10.0 

6.7 

4 

70.0 

A 

35 

100.0 

0.0 

0.0 

1 

97.1 

T 

35 

65.7 

14.3 

20.0 

4 

54.3 

Q 

30 

96.7 

3.3 

0.0 

0 

96.7 

D 

1 

0.0 

100.0 

0.0 

0 

0.0 

O 

37 

94.6 

2.7 

2.7 

0 

94.6 

h 

13 

76.9 

7.7 

15.4 

0 

76.9 

U 

3 

66.7 

0.0 

33.3 

1 

333 

IS 

9 

88.9 

11.1 

0.0 

0 

88.9 

u 

55 

83.6 

14.6 

1.8 

1 

81.8 

dZ 

11 

81.8 

18.2 

0.0 

2 

63.6 

3 

3 

100.0 

0.0 

0.0 

0 

100.0 

P 

41 

90.2 

4.9 

4,9 

12 

61.0 

@ 

150 

66.0 

18.0 

16.0 

11 

58.7 

b 

15 

60.0 

6.7 

33.3 

0 

60.0 

V 

18 

50.0 

33 

16.7 

4 

27.7 

t 

231 

85.3 

6.9 

7.8 

5 

83.1 

el 

49 

100.0 

0.0 

0.0 

0 

100.0 

d 

82 

63.4 

17.1 

19.5 

9 

52.4 

>1 

51 

94.1 

5.9 

0.0 

2 

90.2 

k 

97 

92.7 

2.1 

5.2 

4 

88.7 

ol 

1 

100.0 

0.0 

0.0 

C 

100.0 

8 

41 

78.0 

17.1 

4.9 

0 

78.0 

•P 

16 

875 

12.5 

0.0 

0 

875 

m 

49 

85.7 

10.2 

4.1 

7 

71.4 

@U 

56 

82.1 

14.3 

3.6 

1 

80.4 

n 

171 

80.1 

8.8 

11.1 

11 

73.7 

1@ 

17 

100.0 

0.0 

0.0 

1 

94.1 

N 

18 

50.0 

22.2 

27.8 

1 

44.4 

e@ 

2 

50.0 

50.0 

0.0 

0 

50.0 

1 

75 

86.7 

9.3 

4.0 

4 

813 

<»t> 

21 

38.1 

57.1 

4.8 

0 

38.1 

r 

103 

96.1 

1.0 

2.9 

0 

96.1 

<oh> 

6 

333 

66.7 

0.0 

0 

333 

w 

44 

86.4 

9.1 

4.5 

2 

81.8 

<of> 

11 

27.3 

72.7 

0.0 

0 

273 

j 

14 

71.4 

14.3 

14.3 

0 

71.4 

<or> 

2 

50.0 

50.0 

0.0 

0 

50.0 

Table  Cl.  Phoneme  recognition  results  for  Speaker  2. 
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Appendix  C.  Results  in  full 


Phoneme  .Tote! 

% 

Cor 

% 

Sub 

% 

Del 

No.  of 

% 

Ace 

Phoneme  Total 

% 

Cot 

% 

Sub 

% 

Del 

No.  of  % 

s 

136 

86.8 

7.4 

5.8 

4 

83.8 

i 

48 

93.7 

4.2 

2.1 

0 

93.7 

z 

57 

82.5 

7.0 

10.5 

3 

77  2 

I 

160 

765 

15.6 

8.1 

15 

66.9 

S 

31 

96.8 

0.0 

3.2 

0 

96.8 

E 

83 

85.6 

4.8 

9.6 

4 

80.7 

f 

78 

83.4 

12.8 

3.8 

2 

80.8 

( 

57 

86.0 

12.3 

1.7 

0 

86.0 

V 

30 

50.0 

6.7 

43.3 

4 

36.7 

A 

35 

945 

5.7 

0.0 

1 

91.4 

T 

36 

75.0 

16.7 

8.3 

2 

69.4 

Q 

13 

84.6 

0.0 

15.4 

0 

84.6 

D 

0 

— 

_ 

- 

_ 

O 

37 

100.0 

0.0 

0.0 

0 

100.0 

h 

13 

61.5 

23.1 

15.4 

2 

462 

U 

3 

66.7 

0.0 

335 

0 

66.7 

tS 

9 

88.9 

11.1 

0.0 

0 

88.9 

u 

55 

69.1 

275 

3.6 

1 

675 

dZ 

11 

63.6 

18.2 

18.2 

0 

63.6 

3 

3 

100.0 

0.0 

0.0 

0 

66.7 

P 

41 

82.9 

12.2 

4.9 

10 

585 

@ 

150 

64.7 

18.0 

175 

20 

515 

b 

15 

53.4 

33.3 

13.3 

4 

26.7 

V 

35 

745 

17.1 

8.6 

2 

68.6 

t 

231 

87.0 

10.4 

2.6 

10 

82.7 

el 

49 

91.8 

8.2 

0.0 

0 

91.8 

d 

82 

70.7 

15.9 

13.4 

9 

59.8 

•I 

51 

94.1 

5.9 

0.0 

0 

94.1 

k 

97 

90.7 

5.2 

4.1 

1 

89.7 

ol 

1 

100.0 

0.0 

0.0 

0 

100.0 

$ 

41 

87.8 

9.8 

2.4 

0 

87.8 

•U 

16 

93.8 

6.2 

0.0 

1 

875 

m 

49 

91.8 

4.1 

4.1 

3 

85.7 

@U 

56 

76.8 

21.4 

1.8 

0 

76.8 

n 

171 

77.2 

12.3 

10.5 

2 

76.0 

I® 

17 

82.4 

17.6 

0.0 

0 

82.4 

N 

18 

50.0 

33.3 

16.7 

0 

50.0 

e® 

2 

50.0 

50.0 

0.0 

0 

50.0 

i 

75 

76.0 

10.7 

13.3 

7 

66.7 

<it> 

21 

42.9 

57.1 

0.0 

0 

42.9 

r 

103 

88.4 

5.8 

5.8 

3 

85.4 

<oh> 

6 

50.0 

50.0 

0.0 

0 

50.0 

w 

44 

90.9 

6.8 

2.3 

2 

86.4 

<ofo 

11 

45.4 

36.4 

18.2 

0 

45.4 

j 

14 

85.7 

14.3 

0.0 

0 

85.7 

<or> 

2 

0.0 

100.0 

0.0 

0 

0.0 

Table  C3.  Phoneme  recognition  performance  for  Speaker  3. 


%  %  %  No.  of  %  %  %  %  No.  of  % 


Phoneme  Total 

Del 

Ins 

Phoneme  Total 

Cor 

Sub 

Del 

Ins 

Acc 

s 

408 

85.8 

8.6 

5.6 

23 

80.2 

i 

224 

86.1 

11.2 

2.7 

9 

82.1 

z 

171 

80.1 

12.9 

7.0 

24 

65.5 

i 

394 

72.1 

18.3 

9.6 

32 

64.0 

s 

93 

93.5 

1.1 

5.4 

0 

93.5 

E 

249 

815 

10.8 

7.6 

7 

78.7 

f 

234 

87.6 

7.7 

4.7 

10 

83.3 

1 

165 

885 

10.3 

1.2 

4 

86.1 

V 

90 

56.7 

13.3 

30.0 

13 

42.3 

A 

111 

97.3 

2.7 

0.0 

2 

955 

T 

107 

74.8 

11.2 

14.0 

10 

655 

Q 

56 

89.3 

3.6 

7.1 

0 

895 

D 

1 

00 

100.0 

0.0 

0 

0.0 

O 

111 

91.9 

3.6 

45 

1 

91.0 

h 

39 

66.7 

17.9 

15.4 

2 

61.6 

11 

9 

77.8 

0.0 

22.2 

1 

667 

tS 

27 

81.5 

18.5 

0.0 

0 

81.5 

u 

165 

75.2 

18.8 

6.0 

5 

72.2 

dZ 

33 

72.7 

18.2 

9.1 

2 

66.6 

3 

9 

100.0 

0.0 

0.0 

2 

77.2 

P 

123 

88.6 

7.3 

4.1 

36 

59.3 

@ 

450 

62.2 

16.9 

209 

40 

53.3 

b 

45 

57.9 

17.8 

24.2 

6 

44.4 

V 

88 

72.7 

19.3 

8.0 

8 

63.6 

l 

693 

85.9 

7.6 

6.5 

23 

82.6 

el 

147 

93.9 

6.1 

0.0 

0 

93.9 

d 

246 

64.6 

15.9 

19.5 

51 

43.9 

•1 

153 

915 

5.9 

2.6 

2 

89.2 

k 

291 

91.1 

3.4 

5.5 

10 

87.0 

ol 

3 

100.0 

0.0 

0.0 

0 

100.0 

g 

123 

80.5 

14.6 

4.9 

0 

805 

■u 

48 

89.6 

6.3 

4.1 

1 

875 

m 

147 

72.2 

15.6 

12.2 

12 

64.0 

@u 

168 

76.8 

19.6 

3.6 

4 

74.4 

n 

513 

70.8 

15.4 

13.8 

21 

66.7 

51 

92.2 

7.8 

0.0 

i 

90.2 

N 

54 

51.9 

27.8 

20.3 

1 

50.0 

«@ 

6 

50.0 

50.0 

0.0 

0 

50.0 

I 

225 

80.5 

8.4 

11.1 

17 

72.9 

<at> 

63 

335 

65.1 

1.6 

0 

335 

r 

309 

90.9 

2.9 

6.2 

7 

88.7 

<oh> 

18 

44.4 

55.6 

0.0 

0 

44.4 

w 

132 

85.6 

10.6 

3.8 

4 

826 

<of> 

33 

39.4 

54.5 

6.1 

0 

39.4 

j 

42 

85.7 

9.5 

4.8 

2 

80.9 

<OT> 

6 

335 

66.7 

0.0 

0 

335 

Table  C4.  Phoneme  recognition  performance  for  all  speakers  combined. 


Appendix  C.  Results  in  full 


% 

% 

% 

No.  of 

% 

Cot 

Del 

Xcc 

Total 

Plosive 

81.1 

7.7 

11.2 

62 

65.8 

507 

Affricate 

70.0 

25.0 

5.0 

0 

70.0 

20 

StrFric 

84.4 

103 

5.4 

20 

75.4 

224 

WkFric 

75.8 

9.6 

14.6 

12 

683 

157 

Liq/Glide 

843 

53 

10.2 

12 

793 

236 

Nasal 

51.7 

26.9 

21.4 

3 

30.4 

238 

Vfewel 

76.0 

133 

103 

44 

71.0 

868 

mm 

14.1 

g.g 

mm 

&.6 

Table  C5.  Manner  results  for  Speaker  1. 

% 

% 

% 

No.  of 

% 

I 

Cor 

Del 

Total 

Plosive 

823 

8.3 

9.5 

30 

763 

507 

Affricate 

85.0 

15.0 

0.0 

2 

75.0 

20 

Sir  Fric 

84.8 

9.4 

5.8 

20 

75.9 

224 

WkFric 

813 

8.9 

9.6 

13 

733 

157 

Liq/Glide 

89.9 

5.9 

43 

6 

873 

236 

Nasal 

79.0 

10.1 

10.9 

19 

71.0 

238 

Vowel 

82.0 

123 

5.8 

30 

78.6 

868 

Average 

6.5 

17.1 

76.7 

Table  C6.  Manner  results  for  Speaker  2. 

% 

% 

% 

No.  of 

% 

rhtwbkS 

Cor 

Del 

Acc 

Total 

Plosive 

83.8 

u.i 

5.1 

34 

77.1 

507 

Affricate 

75.0 

15.0 

10.0 

0 

75.0 

20 

Sti  Flic 

87.0 

6.3 

6.7 

7 

83.9 

224 

WkFric 

73.2 

13.4 

13.4 

10 

66.9 

157 

Liq/Glide 

84.7 

8.1 

7,2 

12 

79.9 

236 

Nasal 

782 

12.2 

9.6 

5 

76.1 

238 

Vowel 

803 

12.9 

6.6 

45 

753 

871 

11.3 

'  3.'4' 

76.3 

Table  C7.  Manner  results  for  Speaker  3. 

Class 

% 

Cor 

% 

Sub 

% 

Del 

No.  of 

Ins 

% 

Total 

Plosive 

82.4 

9.0 

8.6 

126 

74.1 

1521 

Affricate 

76.7 

18.3 

5.0 

2 

733 

60 

Sir  Fric 

85.4 

8.6 

6.0 

47 

78.4 

672 

WkFric 

76.9 

10.6 

12.5 

35 

69.4 

471 

Liq/Glide 

86.3 

63 

7.2 

30 

82.1 

708 

Nasal 

69.6 

16.4 

14.0 

27 

65.8 

714 

Vowel 

793 

12.9 

7.6 

119 

75.0 

2607 

Average  793  T53  8!7  551  77!D  5733 


Table  C8.  Manner  results  for  all  speakers. 
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Appendix  C.  Results  in  full 


% 

% 

% 

No.  of 

% 

Class 

Cot 

Sub 

Del 

Ins 

Acc 

Total 

Labial 

72.4 

133 

143 

30 

62.1 

293 

Alveolar 

76.4 

113 

12.4 

72 

68.0 

855 

Pal-Al 

83.1 

9.2 

7.7 

2 

80.0 

65 

Velar 

803 

10.7 

8.8 

5 

773 

169 

From 

77.0 

153 

7.7 

25 

69.6 

339 

Central 

61.7 

14.4 

23.9 

12 

553 

188 

Back 

82.6 

8.7 

8.7 

4 

79.9 

149 

Fronting 

88.1 

7.9 

4.0 

0 

88.1 

101 

Backing 

75.0 

20.8 

4.2 

3 

70.8 

72 

Centring 

893 

103 

0.0 

0 

893 

19 

Average 

757 

153" 

9.1 

153 

74.1 

2350 

Table  C9.  Place  results  for  Speaker  1. 

% 

% 

96. 

No.  of 

% 

Class 

Cor  _ 

Sub 

Del 

Ins 

Total 

Labial 

833 

83 

8.2 

34 

71.7 

293 

Alveolar 

82.9 

8.7 

8.4 

49 

77  2 

855 

Pal-Al 

89.2 

7.7 

3.1 

2 

86.2 

65 

Velar 

83.4 

83 

83 

5 

803 

169 

From 

80.6 

14.8 

4.6 

8 

783 

345 

Central 

64.9 

193 

15.8 

15 

56.1 

171 

Back 

91.9 

63 

1.8 

3 

90.0 

160 

Fronting 

97.0 

3.0 

0.0 

2 

95.0 

101 

Backing 

83.3 

11.1 

5.6 

1 

81.9 

72 

Centring 

94.7 

53 

0.0 

I 

893 

19 

Average 

85.1 

9.3 

5.6 

~ 1335 

80!5 

235(5 

Table  CIO.  Place  Results  for  Speaker  2. 

% 

% 

% 

No.  of 

% 

Class 

Cor 

_  Sub.. 

Del 

Ins 

Total 

Labial 

79.9 

113 

8.8 

27 

70.6 

293 

Alveolar 

82.3 

10.1 

7.6 

38 

77.9 

855 

Pal-Al 

87.7 

7.7 

4.6 

0 

87.7 

65 

Velar 

83.4 

10.7 

5.9 

3 

81.7 

169 

Front 

823 

10.9 

6.6 

19 

77 .0 

348 

Central 

67.0 

17.6 

15.4 

23 

54.8 

188 

Back 

84.6 

11.9 

3.5 

2 

83.2 

143 

Fronting 

93.1 

6.9 

0.0 

0 

93.1 

101 

Backing 

80.6 

18.1 

1.3 

1 

792 

72 

Centring 

78.9 

21.1 

0.0 

0 

78.9 

19 

Average 

821T 

12.6 

5.4 

113 

78.4 

2253 

Table  Cl  1. 

Place  Results  for  Speaker  3. 

% 

% 

% 

No.  of 

% 

Class 

Coi 

Sub 

Del 

Ins 

Total 

Labial 

78.5 

11.0 

10.5 

91 

68.1 

879 

Alveolar 

803 

10.0 

9.5 

159 

743 

2565 

Pal-Al 

86.7 

8.2 

5.1 

4 

84.6 

195 

Velar 

82.4 

9.9 

7.7 

13 

79.9 

507 

Front 

80.0 

13.7 

63 

52 

75.0 

1032 

Central 

64.5 

17.1 

18.4 

50 

55.4 

547 

Back 

86.6 

8.8 

4.6 

9 

843 

452 

Fronting 

92.8 

5.9 

1.3 

2 

92.1 

303 

Backing 

79.6 

16.7 

3.7 

5 

773 

216 

Centring 

87.7 

123 

0.0 

1 

86.0 

57 

Average 

73.9 

11.5 

6.7 

“383 

TT7 

6753 

Table  C12. 

Place  results  for  all  speakers. 
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Appendix  C.  Results  in  full 


Tible  Cl  3.  Confusion  matrix  for  Speaker  1 
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Appendix  C.  Results  in  full 
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*  s  ^  n 


MIMS***' 


Appendix  C.  Results  in  full 
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Appendix  C.  Results  in  full 


T  11  ‘  «  m  .  i  »  ,  .  .  * 


l  (  AQOL'  ■  3  #  V  >j  •]  , 


3  3  I 

3  11 
3  113 


3  34  .  t 

109  13  1 

3  36 

3  7  sn  23  3 

>  »  3  3  IJ  139 

>  3  1  3  3W  ) 

I  1  3  9  99 


106  II  | 

31  3*3  2  » 

I  7  26 

111  I 


I 


I  311 
>  *19 


3  IfJ  14 
1  29  264  2  I 
2  303  I 
7  14*  j 
3  106 


3  3  3  1 


10 

3  13  13  I 
•II  23 
I  3 

3  2 


2  19 
1  t 

*  I 


UNe  C16.  Confusion  mscm  for  *11  spoken. 
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Appendix  C.  Results  in  full 


%Recogmsed 


Pio 

Aff 

SF 

WF 

LX? 

Nas 

Vow 

Total 

Plosive 

84.8 

0.4 

1.4 

0.4 

0.2 

1.4 

507 

s 

Affricate 

5.0 

70.0 

15.0 

5fl 

20 

p 

StrFric 

17 

0.9 

90.6 

04 

224 

0 

WkFric 

3.8 

0.6 

0.6 

77.1 

23 

157 

k 

Liq/Glide 

1.3 

0.4 

84^3 

3.8 

236 

e 

Nasal 

8.0 

0.8 

4.6 

603 

41 

238 

n 

Vowel 

0.9 

OJ 

0.2 

0.6 

0.2 

871 

868 

Table  07.  Confusion  matrix  for  manner  of  articulation  -  Speaker  1. 


^Recognised 


Pin 

Aff 

SF 

WF 

1,/C? 

Nas 

Vow 

Total 

Plosive 

87.8 

0.6 

0.8 

0.6 

0.4 

0.4 

507 

s 

Affricate 

95*0 

5.0 

20 

p 

StrFric 

913 

21 

0.4 

224 

0 

Wk  Fric 

6.4 

83.4 

0.6 

157 

k 

Liq/Glide 

0.4 

1.3 

903 

3.8 

236 

e 

Nasal 

0.4 

0.4 

853 

2.9 

238 

n 

Vowel 

0.7 

0.1 

0.1 

0.5 

0.8 

91.0 

868 

Table  08.  Confusion  matrix  for  manner  of  articulation  -  Speaker  2. 


^Recognised 


Pin 

Aff 

SF 

WF 

L/G 

Nas 

Vow 

Total 

Plosive 

89.7 

0.6 

On 

1.0 

1.0 

0.6 

1.2 

507 

s 

Affricate 

5.0 

80.0 

5.0 

20 

p 

StrFric 

1.8 

0.4 

90.2 

0.4 

0.4 

224 

0 

WkFric 

5.7 

0.6 

79.6 

0.6 

157 

k 

Liq/Glide 

0.4 

0.4 

873 

1.3 

3.4 

236 

e 

Nasal 

0.4 

85.7 

4.2 

238 

n 

Vowel 

01 

01 

0.8 

03 

91.8 

868 

Table  09.  Confusion  matrix  for  manner  of  articulation  -  Speaker  3. 


^Recognised 


Pio 

Aff 

SF 

WF 

LX? 

Nas 

Vow 

Total 

Plosive 

87.4 

0.3 

0.8 

0.7 

0.6 

0.3 

1.0 

1521 

s 

Affricate 

3.3 

81.7 

6.7 

1.7 

1.7 

60 

p 

StrFric 

1.5 

0.4 

90.8 

03 

0.7 

0.3 

672 

0 

WkFric 

5.3 

01 

0.4 

80.0 

13 

471 

k 

Liq/Glide 

0.7 

0.1 

0.6 

873 

04 

3.7 

708 

c 

Nasal 

2.2 

03 

1.8 

771 

3.8 

714 

n 

Vowel 

0.6 

. 

0.1 

0.2 

0.6 

0.5 

89.9 

2607 

Table  C20.  Confusion  matrix  for  manner  of  articulation  -  all  speakers 
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Appendix  C.  Results  in  full 


lab 

Alv 

P-A 

Yet 

^Recognised 

Fm  Bek 

Cen 

F'g 

B't 

C'g 

Total 

Labial 

82.3 

03 

0.3 

0.3 

03 

1.7 

03 

03 

293 

Alveolar 

0.9 

82.2 

0.2 

0.4 

0.5 

03 

1.1 

0.1 

0.1 

O.'l 

855 

s 

Pal-Alv 

3.1 

90.8 

65 

p 

Velar 

L8 

2.4 

1.8 

828 

03 

03 

12 

169 

0 

From 

81.7 

23 

4.1 

13 

0.9 

0.9 

339 

k 

Back 

0.7 

4.0 

82.6 

13 

13 

149 

e 

Centra] 

32 

1.6 

0.5 

0.5 

3.2 

1.1 

633 

03 

1.1 

03 

188 

n 

Fronting 

4.0 

2.0 

89.1 

1.0 

101 

Backing 

4.2 

14 

6.9 

5.6 

750 

1.4 

72 

Centring 

53 

94.7 

19 

Table  C2 1 .  Confusion  matrix  for  place  of  articulation  -  Speaker  1 . 


9tRecognised 


Lab 

Alv 

P-A 

Yci 

Fm 

Rfk 

Cen 

F’g 

B'l 

C'g 

Total 

Labial 

87.4 

3.4 

0.7 

03 

293 

Alveolar 

2.6 

86.7 

0.1 

04 

02 

0.8 

O.'l 

O.’l 

855 

s 

Pal-Alv 

923 

3.1 

13 

65 

p 

Velar 

0.6 

3.6 

85.7 

0.6 

03 

0.6 

169 

o 

Front 

0.3 

2.3 

1.2 

84.3 

23 

32 

0.9 

03 

345 

k 

Back 

0.6 

1.3 

91.9 

0.6 

0.6 

160 

e 

Centra] 

2.9 

53 

2.9 

690 

12 

0.6 

171 

n 

Fronting 

2.0 

1.0 

97.0 

101 

Backing 

14 

1.4 

6.9 

1.4 

833 

72 

Centring 

53 

94.7 

19 

Table  C22.  Confusion  matrix  for  place  of  articulation  -  Speaker  2. 


^Recognised 


.Lab — Ali _ £=A _ Vcl  Ftn  .  Bek _ Cm _ El* — Big _ C*  Toul 


Labial 

88.7 

4.1 

0.3 

0.3 

1.0 

293 

Alveolar 

2.1 

87.1 

0.6 

0.1 

0.1 

0.1 

1.9 

0.2 

855 

S 

Pal-Alv 

90.8 

1.5 

13 

65 

P 

Velar 

2.4 

7.1 

84.6 

169 

o 

Front 

90.2 

1.4 

0.6 

0.6  0.6 

348 

k 

Back 

1.4 

1.4 

4.2 

84.6 

2.1 

2.1 

143 

e 

Central 

32 

1.6 

7.8 

0.5 

69.1 

1.1 

1.1 

188 

n 

Fronting 

5.0 

93.1 

2.0 

101 

Backing 

4.2 

9.7 

1.4 

80.6 

72 

Centring 

15.8 

5.3 

78.9 

19 

Table  C23.  Confusion  matrix  for  place  of  articulation  -  Speaker  3. 

Lab 

Alv 

P-A_ 

Yel 

^Recognised 

Cen 

Ft 

B't 

C'g 

Total 

Labial 

86.1 

2.6 

0.1 

05 

02 

1.0 

01 

01 

879 

Alveolar 

1.9 

85.4 

0.3 

0.2 

03 

0.2 

12 

02 

0.1 

2565 

s 

Pal-Alv 

1.0 

91.3 

1.5 

03 

03 

195 

p 

Velar 

13 

43 

0.6 

844 

0.4 

02 

06 

0.2 

5.7 

o 

Front 

0.1 

0.8 

0.4 

853 

13 

2.9 

1.0 

03 

03 

1032 

k 

Back 

0.6 

1.1 

2.7 

863 

1.1 

02 

13 

547 

e 

Central 

22 

2.0 

02 

0.2 

5.7 

13 

67.1 

03 

1.1 

04 

452 

n 

Fronting 

3.0 

0.7 

1.0 

93.1 

0.7 

03 

303 

Backing 

03 

18 

0.9 

7.9 

7  8 

79.6 

03 

216 

Centring 

8.8 

1.8 

893 

57 

Table  C24.  Confusion  matrix  for  place  of  articulation  -  all  speakers. 
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